There are many packages in R for network analysis.
Main packages grew up around modeling/analysis frameworks with core authors.
Tutorial is organized by packages with examples.
Covers major packages, but biased to my experience.
Analysis packages:
Visualization packages:
Please ask question about anything you don’t understand, including basic R questions.
Before we start make sure you’ve downloaded the repository here: https://github.com/jcarlen/network_analysis_in_R
And set your working directory to be the top folder of the repository.
# list of major packages once here so you can make sure they're installed
# Most are listed again where first used
# Network-related
library(statnet)
library(igraph)
library(latentnet)
library(visNetwork)
library(ggraph)
library(netCoin)
# Other
library(dplyr)
library(ggrepel)
library(lubridate)
library(stringr)
library(kableExtra)
library(scales)
# Working directory - update for your system
wd = "~/Documents/network_analysis_in_R"
statnetA family of packages for all stages of network analysis based in exponential random graph models (ERGM).
statnet (version 2018.10): Butts (2008), Butts (2015), Hunter et al. (2008), Handcock et al. (2018)
Depends: R (≥ 3.0), tergm (≥ 3.5.2), ergm.count (≥ 3.3), sna (≥ 2.4), tsna (≥ 0.2)
Imports: ergm (≥ 3.9.4), network (≥ 1.13), networkDynamic (≥ 0.9), statnet.common (≥ 4.1.4)
Authors: Handcock, Hunter, Butts, Goodreau, Krivitsky, Bender-deMoll, Morris.
Roots in social networks, sociology, epidemiology - smaller networks with lots of features
Exponential Random Graph Model
\[P(Y=y; X,N,\beta) = \frac{\exp({\beta_1t_1+\beta_2t_2+...+\beta_pt_p})}{C(X,N,\beta)}\]
Describes the probability of a network Y
Maximum entropy distribution on graph space with given expected statistics
Simple to express, may be hard to implement
SO many graphs -> C is intractable \(O(2^{n^2})\) for binary networks
MPLE or MCMC-MLE for inference. Explore the parameter space and find optimal values.
DEGENERACY
ERGM are generally best-suited for smaller networks with interesting attributes that we want to explore in relation to the network structure.
Providers of this data interested in :
We’ll want to control for characteristics like party of appointing president.
Data starts in 1994 because that is the first year with more than one female Supreme Court justice.
Common that our data starts as one or more csv files. Edges and node attributes may be separate. We need to convert this to network data.
sc.edges = read.csv(file.path(wd, "data/GSoCversionOpinions-EDGES.csv"))
sc.nodes = read.csv(file.path(wd, "data/GSoCversionOpinions-NODES.csv"))
summary(sc.edges)
## Source Target Opinion.information Type
## Min. : 1.000 Min. : 15.0 Length:578 Length:578
## 1st Qu.: 4.000 1st Qu.: 61.0 Class :character Class :character
## Median : 8.000 Median :102.0 Mode :character Mode :character
## Mean : 7.351 Mean :103.6
## 3rd Qu.:10.000 3rd Qu.:149.0
## Max. :14.000 Max. :190.0
## ID Case.ID Case.name Case.date
## Min. : 1.0 Min. : 1.00 Length:578 Min. :1994
## 1st Qu.:145.2 1st Qu.:13.00 Class :character 1st Qu.:1996
## Median :289.5 Median :25.00 Mode :character Median :2002
## Mean :289.5 Mean :24.39 Mean :2003
## 3rd Qu.:433.8 3rd Qu.:38.00 3rd Qu.:2010
## Max. :578.0 Max. :45.00 Max. :2017
summary(sc.nodes)
## ID Label Gender Date.of.birth
## Min. : 1.00 Length:190 Length:190 Length:190
## 1st Qu.: 48.25 Class :character Class :character Class :character
## Median : 95.50 Mode :character Mode :character Mode :character
## Mean : 95.50
## 3rd Qu.:142.75
## Max. :190.00
## Religion..specific. Religion..broad. Seniority Appointing.President
## Length:190 Length:190 Length:190 Length:190
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## University.affiliation Role Year.started.role
## Length:190 Length:190 Length:190
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
We discover that this is a bipartite network - source and target don’t overlap.
The edge data includes detailed opinion information which is captured by the “Target” number and “Opinion.information” variables. Let’s condense this slightly to the Main, Concurrence, Dissent etc. labels contained in the “Type” variable. We will do this by grouping the Target data by case name and type. This reduces our decision nodes from 176 to 113.
# Condense edges to coarser decision type
sc.edges.condensed = sc.edges %>% group_by(Source, Case.name, Type) %>% summarize(year = first(Case.date))
sc.edges.condensed$Target = sc.edges.condensed %>% group_by(Case.name, Type) %>% group_indices()
head(sc.edges.condensed, 10)
statnetlibrary(statnet) # loads statnet, tsna, sna, ergm.count, statnet.common, tergm, networkDynamic, ergm, network
n.judges = 14
n.decisions = max(sc.edges.condensed$Target)
sc.edges.condensed$Target = sc.edges.condensed$Target + as.integer(n.judges) #shift these indices up to avoid overlap with Source
sc.bipartite = network(sc.edges.condensed[,c("Source", "Target")],
bipartite = TRUE,
directed = FALSE,
vertex.attr = list(vertex.type = c(rep(2,n.judges), rep(1, n.decisions)),
year = c(rep(2,n.judges), gray(1 - (sc.edges.condensed$year - 1994)/23)),
name = c(rep(2,n.judges), sc.edges.condensed$Case.name),
label = c(as.character(sc.nodes$Label), rep("", n.decisions)),
decision.type = c(rep(1,n.judges), as.numeric(as.factor(sc.edges.condensed$Type))))
)
#outer brackets prevent "plot.new has not been called yet" error
{plot(sc.bipartite,
vertex.sides = 2+(as.numeric(sc.bipartite%v%"vertex.type"))^4,
vertex.cex = sc.bipartite%v%"vertex.type",
vertex.col = sc.bipartite%v%"decision.type",
label = sc.bipartite%v%"label",
main = "Bipartite Network of First Amendment Decisions 1994-2017", edge.col = "gray",
label.col = 4, label.cex = 1.2, label.pos = 3)
legend("topleft", legend = c("Judge", "Decision"), col = 1, pch = c(19,2), cex = .7)
legend("bottomleft", legend = sort(unique(sc.edges.condensed$Type))[-1], col = 2:5, pch = 17, cex = .7)}
# project to adjacency matrix of judges
sc.projection = (as.sociomatrix(sc.bipartite) %*% t(as.sociomatrix(sc.bipartite)))[1:n.judges,1:n.judges]
# slight discrepancy between appearance of judges in edgelist and this projection due to=
# duplicate edges being listed when a decision was split in two (e.g. "part" and "rest")
# Add attributes and make it a network ----
library(lubridate) #for "year" function
sc.nodes$birthyear = year(as.Date(as.character(sc.nodes$Date.of.birth), format = "%m/%d/%Y"))
sc.attr.list = list(label = c(as.character(sc.nodes$Label)),
appointed = as.numeric(as.factor(sc.nodes$Appointing.President[1:n.judges])),
appearances = diag(sc.projection),
religion = as.character(sc.nodes$Religion..broad.[1:n.judges]),
gender = as.character(sc.nodes$Gender[1:n.judges]),
role = as.character(sc.nodes$Role[1:n.judges]))
sc.unimode = network(sc.projection, directed = F, names.eval = "codecisions", ignore.eval = F,vertex.attr = sc.attr.list)
# variable for plotting shapes -- not use of %v% to access vertex attribute
religion.pch = 2 + as.numeric(as.factor(sc.unimode%v%"religion")); religion.pch[religion.pch>4] = 50
{plot(sc.unimode, edge.lwd = "codecisions",
vertex.col = 2*(-1*sc.unimode%v%"appointed"+4),
vertex.cex = sc.unimode%v%"appearances"/15,
vertex.sides = religion.pch,
edge.col = "gray",
label = sc.unimode%v%"label",
main = "Connections of Judges in 1st Amendment Decisions")
legend("topleft", legend = c("Dem. Appoint", "Rep. Appoint"), col = c(4,2), pch = 19, cex = .7)
legend("bottomleft", legend = levels(as.factor(sc.unimode%v%"religion")), col = 1, pch = c(17,18,19), cex = .7)}
Visualization tells us some things about the network, but there’s a limit on the number of things we can visualize at one time.
Statnet has a few layout options, which can change how we see the network.
PLOTTING HAS RANDOMNESS. We can use set.seed to fix a random location, but the randomness can impact interpretation.
Note how centrality and politics (dem vs. repub) affect the layout.
We want to analyze the effects of some factors while controlling for others.
We’ll use the unimodal network for modelling. It’s small, gets at our questions of interest, and is relatively easy to interpret.
#?"ergm-terms" is your friend
sc.unimode%v%"birthyear" = sc.nodes$birthyear[1:n.judges]
sc.ergm = ergm(sc.unimode ~ edges + nodematch("gender") + nodematch("religion") +
nodematch("appointed") + nodecov("birthyear") + absdiff("birthyear") +
nodefactor("role"))
summary(sc.ergm)
## Call:
## ergm(formula = sc.unimode ~ edges + nodematch("gender") + nodematch("religion") +
## nodematch("appointed") + nodecov("birthyear") + absdiff("birthyear") +
## nodefactor("role"))
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges 210.82281 112.74243 0 1.870 0.061491 .
## nodematch.gender -0.18597 0.88816 0 -0.209 0.834145
## nodematch.religion 17.84383 1817.39403 0 0.010 0.992166
## nodematch.appointed 0.07821 0.87046 0 0.090 0.928409
## nodecov.birthyear -0.05317 0.02886 0 -1.842 0.065423 .
## absdiff.birthyear -0.15506 0.04340 0 -3.573 0.000353 ***
## nodefactor.role.CJ -0.64094 0.72492 0 -0.884 0.376610
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 126.15 on 91 degrees of freedom
## Residual Deviance: 48.06 on 84 degrees of freedom
##
## AIC: 62.06 BIC: 79.63 (Smaller is better. MC Std. Err. = 0)
# What is the relationship between birth year and degree in the network?
# Not super strong though noticeable at the tails
plot(sc.unimode%v%"birthyear", sc.unimode%v%"appearances")
summary_formula(sc.unimode ~ density())
## density
## 0.8021978
A density of about 80% is high for a social network. In our current network, which is binary, two judges are connected if they share any decision. This may be masking differences between them. We also saw from the importance of birth year that our results are skewed by how much judges overlapped in their appointments.
Let’s try normalizing and thinning the network using a percentage-wise cutoff for edges to be included. This is a common strategy, but we should apply it carefully because it affects our model. Ideally we would use a cutoff that is meaningful in context.
# try thinning the edges
sc.projection.norm = sc.projection
diag(sc.projection.norm) = 0
sc.projection.norm = sc.projection.norm/rowSums(sc.projection.norm)
hist(sc.projection.norm, breaks = 40) # histogram help choose a cutoff
sc.projection.norm[sc.projection.norm < .1] = 0
sc.unimode.thin = network(sc.projection.norm, directed = F, vertex.attr = sc.attr.list)
sc.unimode.thin%v%"birthyear" = sc.nodes$birthyear[1:n.judges]
plot(sc.unimode.thin,
vertex.col = 2*(-1*sc.unimode%v%"appointed"+4),
vertex.cex = sc.unimode%v%"appearances"/15,
#vertex.sides = religion.pch,
edge.col = "gray",
label = sc.unimode%v%"label",
main = "Network of 1st Amendment Decisions")
# Fit an ERGM to the thinned network
sc.ergm.thin = ergm(sc.unimode.thin ~ edges +
nodematch("gender") +
nodematch("religion") +
nodematch("appointed") +
nodecov("birthyear") +
absdiff("birthyear") +
nodefactor("role"))
summary(sc.ergm.thin)
## Call:
## ergm(formula = sc.unimode.thin ~ edges + nodematch("gender") +
## nodematch("religion") + nodematch("appointed") + nodecov("birthyear") +
## absdiff("birthyear") + nodefactor("role"))
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges 129.82151 62.74212 0 2.069 0.0385 *
## nodematch.gender -0.15126 0.60466 0 -0.250 0.8025
## nodematch.religion 1.76462 0.69246 0 2.548 0.0108 *
## nodematch.appointed 1.30574 0.61572 0 2.121 0.0339 *
## nodecov.birthyear -0.03340 0.01615 0 -2.067 0.0387 *
## absdiff.birthyear -0.06528 0.02536 0 -2.574 0.0101 *
## nodefactor.role.CJ -0.37203 0.57018 0 -0.652 0.5141
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 126.15 on 91 degrees of freedom
## Residual Deviance: 91.33 on 84 degrees of freedom
##
## AIC: 105.3 BIC: 122.9 (Smaller is better. MC Std. Err. = 0)
# Results are more interesting; now remove non-significant terms
sc.ergm.thin = ergm(sc.unimode.thin ~ edges +
#nodematch("gender") +
nodematch("religion") +
nodematch("appointed") +
nodecov("birthyear") +
absdiff("birthyear"))
#nodefactor("role")
summary(sc.ergm.thin)
## Call:
## ergm(formula = sc.unimode.thin ~ edges + nodematch("religion") +
## nodematch("appointed") + nodecov("birthyear") + absdiff("birthyear"))
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges 132.78812 61.39518 0 2.163 0.03055 *
## nodematch.religion 1.77928 0.68848 0 2.584 0.00976 **
## nodematch.appointed 1.17761 0.53993 0 2.181 0.02918 *
## nodecov.birthyear -0.03419 0.01581 0 -2.162 0.03065 *
## absdiff.birthyear -0.06708 0.02512 0 -2.670 0.00757 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 126.15 on 91 degrees of freedom
## Residual Deviance: 91.87 on 86 degrees of freedom
##
## AIC: 101.9 BIC: 114.4 (Smaller is better. MC Std. Err. = 0)
“We do not have Obama judges or Trump judges, Bush judges or Clinton judges. What we have is an extraordinary group of dedicated judges doing their level best to do equal right to those appearing before them.” -Chief Justice John Roberts (11-21-2018)
How can we tell if the fit is good?
The gof function of ergm lets us simulate networks from the fit model and visualize how they compare to the observed network with a built-in plot method. It defaults to comparing along several important graph statistics.
sc.ergm.thin.gof = gof(sc.ergm.thin)
plot(sc.ergm.thin.gof)
# Inspect simulations further
summary_formula(sc.unimode.thin ~ triangles()) #summary_formula replaced summary.statistics
## triangle
## 74
summary_formula(simulate(sc.ergm.thin) ~ triangles())
## triangle
## 80
# For a more holistic picture, compare a simulated to the observed one
par(mfrow = c(1,2))
par(mai = rep(.2, 4))
plot(sc.unimode.thin,
vertex.col = 2*(-1*sc.unimode%v%"appointed"+4),
vertex.cex = sc.unimode%v%"appearances"/15,
#vertex.sides = religion.pch,
edge.col = "gray",
label = sc.unimode%v%"label",
main = "Network of 1st Amendment Decisions", label.cex = .5)
plot(simulate(sc.ergm.thin),
vertex.col = 2*(-1*sc.unimode%v%"appointed"+4),
vertex.cex = sc.unimode%v%"appearances"/15,
#vertex.sides = religion.pch,
edge.col = "gray",
label = sc.unimode%v%"label",
main = "Simulation from model", label.cex = .5)
None of the models we’ve fit so far have required MCMC-MLE due to dyad-indepdence.
# To force MCMC
sc.ergm.thin.mle = ergm(sc.unimode.thin ~ edges +
nodematch("religion") +
nodematch("appointed") +
nodecov("birthyear") +
absdiff("birthyear"),
estimate = "MLE", control = control.ergm("force.main" = TRUE))
# Dyad-dependent terms always force MCMC
sc.ergm.thin.gw = ergm(sc.unimode.thin ~ edges +
nodematch("religion") +
nodematch("appointed") +
nodecov("birthyear") +
absdiff("birthyear") +
#gwesp(.5, fixed = T) +
gwdegree(.5, fixed = T))
# Then we would check MCMC diagnostics and goodness-of-fit
summary(sc.ergm.thin.gw)
mcmc.diagnostics(sc.ergm.thin.gw)
sc.ergm.thin.gw.gof = gof(sc.ergm.thin.gw)
plot(sc.ergm.thin.gw.gof)
# Compare simulated to true
par(mfrow = c(1,2))
plot(sc.unimode.thin,
vertex.col = 2*(-1*sc.unimode%v%"appointed"+4),
vertex.cex = sc.unimode%v%"appearances"/15,
#vertex.sides = religion.pch,
edge.col = "gray",
label = sc.unimode%v%"label",
main = "Network of 1st Amendment Decisions")
plot(simulate(sc.ergm.thin.gw),
vertex.col = 2*(-1*sc.unimode%v%"appointed"+4),
vertex.cex = sc.unimode%v%"appearances"/15,
#vertex.sides = religion.pch,
edge.col = "gray",
label = sc.unimode%v%"label",
main = "Simulation from model")
Dyad-dependent terms like the geometrically weighted terms are critical when the network shows clustering, as most social networks do.
Adding a triangles term to the model formula seems like a good idea but will fail due to model degeneracy. ergm will return an error message, e.g. “MCMLE estimation stuck. There may be excessive correlation between model terms, suggesting a poor model for the observed data.”
?"ergm-terms")and plot a simulation from your modeligraphMore algorithmic than model-based.
Lots of descriptive statistics and some models too. Many community detection algorithms and many layout options for plots.
Often better-suited to BIG networks.
igraph has a Python version too.
igraph:: prefix.igraph (version 1.2.2): Csardi and Nepusz (2006)
Depends: methods; Imports: graphics, grDevices, magrittr, Matrix, pkgconfig (≥ 2.0.0), stats, utils
Author: Gabor Csardi - background in computer science, biophysics, statistics
Analyze a network of statistics journals where connections describe how much one journal cites another.
Data used is in supplementary material from a paper available here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4755202/ In this case we don’t have any features about the journals outside the citation counts.
We may be interested in:
igraph) objectOur network comes in the form of a valued adjacency matrix.
We want to retain the information from valued edges wherever possible, instead of reducing it to a binary network.
We remove self-citations by making the diagonal of the adjacency matrix zero. In previous analysis we found that the highest entries in the adjacency matrix are from journals citing themselves, e.g. CSDA (486) and StMed (628). These values could inflate journal importance.
First we load the data and convert it into a few different igraph networks (weighted, binary, undirected)
library(igraph)
Cmatrix = as.matrix(read.csv(file.path(wd, "data/cross-citation-matrix.csv"), row.names = 1)) #47 x 47
Cmatrix.diag = Cmatrix #store a copy before removing diag
diag(Cmatrix) = rep(0,47)
Tmatrix = Cmatrix + t(Cmatrix)
# We tranpose the Cmatrix to correspond to standard i,j entry = citation FROM i to j.
# Original Cmatrix has i,j indicating citation from j to i:
Cgraph = graph_from_adjacency_matrix(t(Cmatrix), mode = "directed", weighted = TRUE)
Cgraph.bin = graph_from_adjacency_matrix(t(Cmatrix), mode = "directed", weighted = "citations")
Cgraph.bin.sym = graph_from_adjacency_matrix(Tmatrix, mode = "undirected", weighted = "citations")
Cgraph.sym = graph_from_adjacency_matrix(Tmatrix, mode = "undirected", weighted = TRUE)
Then we look at some basic attributes of the networks to compare and check that they were created correctly
edge_density(Cgraph)
## [1] 0.6563367
edge_density(Cgraph.sym)
## [1] 0.8057354
is.weighted(Cgraph)
## [1] TRUE
is.weighted(Cgraph.bin)
## [1] FALSE
Cgraph.weights = igraph::get.edge.attribute(Cgraph, "weight") #notice the name "weight"
#citations is an edge variable in the binary graph, not an edge weight
identical(Cgraph.weights, igraph::get.edge.attribute(Cgraph.bin, "citations"))
## [1] TRUE
igraph has many more layout options than statnet. You can use the intergraph (Bojanowski (2015)) package to easily convert network back and forth between the two packages (primarily the functions asIgraph and asNetwork) .
We can tailor the layout to our purposes, which may be visibility of nodes, visualization of clusters, or visibility of relationships.
par(mfrow = c(3,3))
par(mai = rep(.2, 4))
arrow.size = .2
edge.width = .2
plot(Cgraph, edge.arrow.size = arrow.size, edge.arrow.width = arrow.size, edge.width = edge.width,
layout = layout.grid, main = "layout.grid")
plot(Cgraph, edge.arrow.size = arrow.size, edge.arrow.width = arrow.size, edge.width = edge.width,
layout = layout.circle, main = "layout.cirlce")
plot(Cgraph, edge.arrow.size = arrow.size, edge.arrow.width = arrow.size, edge.width = edge.width,
layout = layout.star, main = "layout.star")
plot(Cgraph, edge.arrow.size = arrow.size, edge.arrow.width = arrow.size, edge.width = edge.width,
layout = layout.davidson.harel, main = "layout.davidson.harel")
plot(Cgraph, edge.arrow.size = arrow.size, edge.arrow.width = arrow.size, edge.width = edge.width,
layout = layout.graphopt, main = "layout.graphopt")
plot(Cgraph, edge.arrow.size = arrow.size, edge.arrow.width = arrow.size, edge.width = edge.width,
layout = layout.mds, main = "layout.mds")
plot(Cgraph, edge.arrow.size = arrow.size, edge.arrow.width = arrow.size, edge.width = edge.width,
layout = layout_nicely, main = "layout_nicely")
plot(Cgraph, edge.arrow.size = arrow.size, edge.arrow.width = arrow.size, edge.width = edge.width,
layout = layout.auto, main = "layout.fruchterman.reingold")
plot(Cgraph, edge.arrow.size = arrow.size, edge.arrow.width = arrow.size, edge.width = edge.width,
layout = layout.gem, main = "layout.gem")
Other ways to change the plot layour or supply a custom layout
# Another way to change the layout
Cgraph <- set_graph_attr(Cgraph, "layout", layout_with_kk(Cgraph)) #then plotting defaults to this layout
# Producting/extracting layout coordinates
l = igraph::layout_on_grid(Cgraph)
l
## [,1] [,2]
## [1,] 0 0
## [2,] 1 0
## [3,] 2 0
## [4,] 3 0
## [5,] 4 0
## [6,] 5 0
## [7,] 6 0
## [8,] 0 1
## [9,] 1 1
## [10,] 2 1
## [11,] 3 1
## [12,] 4 1
## [13,] 5 1
## [14,] 6 1
## [15,] 0 2
## [16,] 1 2
## [17,] 2 2
## [18,] 3 2
## [19,] 4 2
## [20,] 5 2
## [21,] 6 2
## [22,] 0 3
## [23,] 1 3
## [24,] 2 3
## [25,] 3 3
## [26,] 4 3
## [27,] 5 3
## [28,] 6 3
## [29,] 0 4
## [30,] 1 4
## [31,] 2 4
## [32,] 3 4
## [33,] 4 4
## [34,] 5 4
## [35,] 6 4
## [36,] 0 5
## [37,] 1 5
## [38,] 2 5
## [39,] 3 5
## [40,] 4 5
## [41,] 5 5
## [42,] 6 5
## [43,] 0 6
## [44,] 1 6
## [45,] 2 6
## [46,] 3 6
## [47,] 4 6
plot(Cgraph, layout = l)
Of course we can also change the color, size and other attributes of the vertices and edges. Vertex attributes begin with vertex. and edge attributes begin with edge..
plot(Cgraph, layout = layout.mds, vertex.size = colSums(Cmatrix)/100,
vertex.color = "skyblue", vertex.shape = "csquare", vertex.frame.color = "white",
vertex.label.cex = .5, vertex.label.color = "black", vertex.label.family = "mono",
edge.width = Cgraph.weights/20, edge.color = "orange", edge.curved = .5, edge.arrow.size = .3,
margin = 0, main = "Playing with plot settings")
How can we make the overlapping nodes more visible?
We could do this in a ggplot family package and use ggrepel to force those central nodes apart (Pedersen (2018), Slowikowski (2018)).
library(ggraph)
library(ggrepel)
ggraph(Cgraph, 'igraph', algorithm = 'mds') +
geom_edge_link(colour = "yellow2", alpha = 0.8, show.legend = F) +
geom_node_point(size = colSums(Cmatrix)/100, color = "blue") +
geom_node_label(aes(label = name), color = "blue", repel = TRUE, label.padding = .15) +
ggtitle("ggraph version of the plot using ggrepel") + theme_void()
ggnet, ggnet2 and ggnetwork?ggnet and ggnet2 are network plotting functions. “ggnet2 is an improved version of ggnet.” Both have been folded into the GGally package. (https://briatte.github.io/ggnet/)
ggnetwork is from the same author as ggnet and ggnet2, Francois Briatte, but is not actively maintained, while ggraph is. This back and forth between the package authors clears things up: https://github.com/thomasp85/ggraph/issues/3.
It seems we have coded more or less the same thing: https://github.com/briatte/ggnetwork … but your project goes further than mine… Also, my ggnetwork package was written as a proof-of-concept: unless I need to submit it for CRAN to get the paper published, I won’t submit it and just leave it on GitHub, so feel free to take everything you need from it for your own package, if there is anything useful in there for you!” - Francois Briatte.
tldr: ggraph is the go-to “gg”” package for networks.
The plot above uses the multidimensional scaling layout which attempts to preserve in two dimensions some type of distance calculated between nodes in the graph, e.g. the shortest path distance. From this layout we see some central clustering in the plot. We also see that the most of the journals with high numbers of citations, calculated by colSums(Cmatrix), are generally near the middle. This leads us to our next question?
And a related question: which journals are most central in the network?
Centrality and importance in a network are highly related, but not identical.
A baseline measure of the centrality of a journal could be its total degree, the sum of its in- and out- degrees. A baseline measure of the importance of a journal could be its in-degree, which in this case means the number of citations it receives from other journals. We will compare several metrics in igraph to these baselines. We focus on ones that apply to directed networks.
\[ \begin{aligned} x_i' &= \sum_jA_{ij}x_j\\ \bf{x'} &= \bf{\kappa_1x} = \bf{Ax} \\ \end{aligned} \]
\[ \begin{aligned} x_i &= \alpha \sum_jA_{ij}\frac{x_j}{k^{out}_j} + \beta \\ \bf{x} &= \alpha\bf{AD^{-1}x} +\beta\bf{1} \\ \end{aligned} \] where we let \(k^{out}_j\) be 1 where it’s observed to be 0, and \(D\) is a matrix whose entries are zero except the diagonal, where \(D_{ii} = max(1, k^{out}_i)\). There are versions where \(\beta\) is customized for each node, i.e. \(\beta_i\).
igraphDegree centrality
The degree function of igraph returns the degree of each vertex, the number of connections. The strength functions returns the weighted degrees of each vertex, the sum of the weighted connections. The mode argument of each determines, for directed networks, if you want the in-degree, out-degree or the total.
degree(Cgraph, mode = "in") # = rowSums(Cmatrix>0)
strength(Cgraph, mode = "in") #= rowSums(Cmatrix)
degree(Cgraph, mode = "out") # = colSums(Cmatrix>0)
strength(Cgraph, mode = "out") #= colSums(Cmatrix)
Betweenness centrality
In general, centrality functions on weighted graphs will use the graph’s weight variable by default. We may have to transform it.
# Binary version
cen.between.bin = estimate_betweenness(Cgraph.bin, directed = TRUE, cutoff = Inf, weights = NULL)
# Weighted version, with a decreasing function of the citation weights as the "distances"
cen.between = estimate_betweenness(Cgraph, directed = TRUE, cutoff = Inf, weights = max(log(Cgraph.weights+1)) - log(Cgraph.weights))
# Relationship with corresponding degree centrality
{par(mfrow = c(1,2))
plot(igraph::degree(Cgraph, mode = "total"), cen.between.bin, type = "n")
text(igraph::degree(Cgraph, mode = "total"), cen.between.bin, labels = rownames(Cmatrix), cex = .5)
plot(strength(Cgraph, mode = "total"), cen.between, type = "n")
text(strength(Cgraph, mode = "total"), cen.between, labels = rownames(Cmatrix), cex = .5)}
Closeness centrality
The closeness centrality in the ``in’’ direction is a better indicator of influence of a journal than in the “out” direction.
# Binaryv ersion
cen.closeness.bin = estimate_closeness(Cgraph.bin, mode = c("in"), cutoff = Inf, weights = NULL, normalized = FALSE)
# Weighted version, with a decreasing function of the citation weights as the "distances"
cen.closeness = estimate_closeness(Cgraph, mode = c("in"), cutoff = Inf, weights = max(log(Cgraph.weights+1)) - log(Cgraph.weights), normalized = FALSE)
# Relationship with corresponding in-degree centrality
{par(mfrow = c(1,2))
plot(igraph::degree(Cgraph, mode = "in"), cen.closeness.bin, type = "n")
text(igraph::degree(Cgraph, mode = "in"), cen.closeness.bin, labels = rownames(Cmatrix), cex = .5)
plot(strength(Cgraph, mode = "in"), cen.closeness, type = "n")
text(strength(Cgraph, mode = "in"), cen.closeness, labels = rownames(Cmatrix), cex = .5)}
Eigenvector centrality
Eigenvector centrality interprets weights as connection strength, so no transformation of weights is needed.
cen.eigen.bin = eigen_centrality(Cgraph.bin, directed = TRUE)$vector
cen.eigen = eigen_centrality(Cgraph, directed = TRUE)$vector
{par(mfrow = c(1,2))
plot(igraph::degree(Cgraph, mode = "in"), cen.eigen.bin, type = "n", main = "binary")
text(igraph::degree(Cgraph, mode = "in"), cen.eigen.bin, labels = rownames(Cmatrix), cex = .5)
plot(strength(Cgraph, mode = "in"), cen.eigen, type = "n", main = "weighted")
text(strength(Cgraph, mode = "in"), cen.eigen, labels = rownames(Cmatrix), cex = .5)}
PageRank
PageRank also interprets weights as connection strength, originally number of links to a website.
cen.pagerank.bin = page_rank(Cgraph.bin, directed = TRUE)$vector
cen.pagerank = page_rank(Cgraph, directed = TRUE)$vector
{par(mfrow = c(1,2))
plot(igraph::degree(Cgraph, mode = "in"), cen.pagerank.bin, type = "n", main = "binary")
text(igraph::degree(Cgraph, mode = "in"), cen.pagerank.bin, labels = rownames(Cmatrix), cex = .5)
plot(strength(Cgraph, mode = "in"), cen.pagerank, type = "n", main = "weighted")
text(strength(Cgraph, mode = "in"), cen.pagerank, labels = rownames(Cmatrix), cex = .5)}
How do we choose?
Summary of centrality measures (by resulting journal rank):
library(kableExtra)
cen.compare = data.frame(degree.in = rank(-strength(Cgraph, mode = "in"), ties.method = "first"),
degree.total = rank(-strength(Cgraph, mode = "total"), ties.method = "first"),
betweenness = rank(-cen.between, ties.method = "first"),
closeness = rank(-cen.closeness),
eigenvector = rank(-cen.eigen),
pagerank = rank(-cen.pagerank),
ratio = rank(-strength(Cgraph, mode = "in")/strength(Cgraph, mode = "out")))
kable(cen.compare, caption = "Comparison of centrality measures on the weighted journal network") %>%
kable_styling("striped", full_width = F) %>%
row_spec(c(3,8,19,26), bold = F, color = "white", background = "darkred", font_size = 10) %>%
row_spec((1:47)[-c(3,8,19,26)], bold = F, color = "black", font_size = 10)
| degree.in | degree.total | betweenness | closeness | eigenvector | pagerank | ratio | |
|---|---|---|---|---|---|---|---|
| AmS | 24 | 34 | 10 | 25 | 26 | 26 | 15 |
| AISM | 21 | 30 | 11 | 29 | 25 | 23 | 11 |
| AoS | 2 | 5 | 6 | 2 | 2 | 2 | 2 |
| ANZS | 38 | 41 | 12 | 39 | 36 | 38 | 18 |
| Bern | 19 | 24 | 13 | 17 | 16 | 16 | 14 |
| BioJ | 17 | 15 | 14 | 16 | 17 | 20 | 32 |
| Bcs | 4 | 6 | 5 | 3 | 5 | 5 | 5 |
| Bka | 5 | 7 | 15 | 5 | 4 | 4 | 3 |
| Biost | 12 | 13 | 16 | 9 | 10 | 11 | 13 |
| CJS | 20 | 23 | 17 | 21 | 18 | 18 | 16 |
| CSSC | 28 | 18 | 18 | 33 | 37 | 33 | 46 |
| CSTM | 14 | 12 | 7 | 19 | 21 | 19 | 34 |
| CmpSt | 44 | 42 | 19 | 46 | 43 | 42 | 40 |
| CSDA | 8 | 3 | 3 | 8 | 9 | 8 | 35 |
| EES | 46 | 44 | 20 | 42 | 44 | 44 | 38 |
| Envr | 35 | 35 | 21 | 28 | 30 | 29 | 27 |
| ISR | 41 | 45 | 22 | 40 | 41 | 37 | 12 |
| JABES | 43 | 43 | 23 | 38 | 39 | 41 | 31 |
| JASA | 1 | 1 | 2 | 1 | 1 | 1 | 4 |
| JAS | 34 | 19 | 24 | 36 | 40 | 40 | 47 |
| JBS | 37 | 26 | 25 | 26 | 33 | 36 | 45 |
| JCGS | 13 | 16 | 26 | 13 | 12 | 12 | 10 |
| JMA | 11 | 8 | 9 | 11 | 11 | 10 | 37 |
| JNS | 25 | 22 | 27 | 35 | 27 | 28 | 39 |
| JRSS-A | 29 | 38 | 28 | 18 | 23 | 25 | 7 |
| JRSS-B | 3 | 9 | 29 | 4 | 3 | 3 | 1 |
| JRSS-C | 23 | 28 | 30 | 24 | 24 | 22 | 28 |
| JSCS | 26 | 21 | 31 | 31 | 31 | 30 | 41 |
| JSPI | 7 | 4 | 4 | 7 | 7 | 7 | 30 |
| JSS | 39 | 33 | 32 | 37 | 38 | 35 | 44 |
| JTSA | 32 | 39 | 33 | 34 | 34 | 34 | 8 |
| LDA | 30 | 31 | 34 | 27 | 28 | 27 | 25 |
| Mtka | 27 | 32 | 35 | 32 | 32 | 32 | 17 |
| SJS | 15 | 17 | 36 | 14 | 14 | 15 | 9 |
| StataJ | 47 | 47 | 37 | 41 | 47 | 47 | 21 |
| StCmp | 22 | 25 | 38 | 23 | 20 | 17 | 33 |
| Stats | 36 | 36 | 39 | 43 | 35 | 39 | 26 |
| StMed | 6 | 2 | 1 | 6 | 6 | 6 | 20 |
| SMMR | 33 | 29 | 40 | 22 | 29 | 31 | 36 |
| StMod | 40 | 40 | 41 | 44 | 42 | 43 | 29 |
| StNee | 45 | 46 | 42 | 45 | 45 | 45 | 23 |
| StPap | 42 | 37 | 43 | 47 | 46 | 46 | 43 |
| SPL | 9 | 11 | 8 | 12 | 13 | 13 | 19 |
| StSci | 16 | 14 | 44 | 15 | 15 | 14 | 24 |
| StSin | 10 | 10 | 45 | 10 | 8 | 9 | 22 |
| Tech | 18 | 27 | 46 | 20 | 19 | 21 | 6 |
| Test | 31 | 20 | 47 | 30 | 22 | 24 | 42 |
ratio column is based on each journal’s ratio of in-citations to out-citations of a journal.I’ve illustrated some commonly used centrality measures in igraph, but there are many others.
The CINNA package will list many centrality measures available for a graph and calculate as many of them as you’d like in one function call (Ashtiani, Mirzaie, and Jafari (2018)). However, it doesn’t allow you to set all the parameters that you can set when calling these measures directly from igraph or their other native packages.
library(CINNA)
# see available centralities
cen.proper = proper_centralities(Cgraph)
## [1] "Alpha Centrality"
## [2] "Burt's Constraint"
## [3] "Page Rank"
## [4] "Average Distance"
## [5] "Barycenter Centrality"
## [6] "BottleNeck Centrality"
## [7] "Centroid value"
## [8] "Closeness Centrality (Freeman)"
## [9] "ClusterRank"
## [10] "Decay Centrality"
## [11] "Degree Centrality"
## [12] "Diffusion Degree"
## [13] "DMNC - Density of Maximum Neighborhood Component"
## [14] "Eccentricity Centrality"
## [15] "Harary Centrality"
## [16] "eigenvector centralities"
## [17] "K-core Decomposition"
## [18] "Geodesic K-Path Centrality"
## [19] "Katz Centrality (Katz Status Index)"
## [20] "Kleinberg's authority centrality scores"
## [21] "Kleinberg's hub centrality scores"
## [22] "clustering coefficient"
## [23] "Lin Centrality"
## [24] "Lobby Index (Centrality)"
## [25] "Markov Centrality"
## [26] "Radiality Centrality"
## [27] "Shortest-Paths Betweenness Centrality"
## [28] "Current-Flow Closeness Centrality"
## [29] "Closeness centrality (Latora)"
## [30] "Communicability Betweenness Centrality"
## [31] "Community Centrality"
## [32] "Cross-Clique Connectivity"
## [33] "Entropy Centrality"
## [34] "EPC - Edge Percolated Component"
## [35] "Laplacian Centrality"
## [36] "Leverage Centrality"
## [37] "MNC - Maximum Neighborhood Component"
## [38] "Hubbell Index"
## [39] "Semi Local Centrality"
## [40] "Closeness Vitality"
## [41] "Residual Closeness Centrality"
## [42] "Stress Centrality"
## [43] "Load Centrality"
## [44] "Flow Betweenness Centrality"
## [45] "Information Centrality"
## [46] "Dangalchev Closeness Centrality"
## [47] "Group Centrality"
## [48] "Harmonic Centrality"
## [49] "Local Bridging Centrality"
## [50] "Wiener Index Centrality"
## [51] "Weighted Vertex Degree"
# calculate the first five of them
cen.compare.cinna = calculate_centralities(Cgraph, include = cen.proper[1:5])
Community detection and graph partitioning are two related approaches to dividing a network into meaningful sub-groups. In the latter, the number and size of groups is generally set beforehand, and the goal is to minimize edges between groups. Community detection has the more generally stated goal to “find the natural fault lines” (Newman (2010)). What exactly that means varies by method.
Graph partitioning algorithms generally look to minimize the “cut size,” i.e. the number or weight of the edges that must be cut to separate into two communities of certain sizes.
\[\text{cut size }(R) = \sum_{g(i)\neq g(j)}A_{ij},\]
where \(g(i)\) is the group of node \(i\) after the cut. If an undirected graph, a factor of \(1/2\) is added to the expression.
In igraph the min_cut function is an implementation of this.
min_cut(Cgraph.bin, value.only = FALSE)
## $value
## [1] 4
##
## $cut
## + 4/1419 edges from 07a8b27 (vertex names):
## [1] CSDA ->StataJ JRSS-A->StataJ JSPI ->StataJ StMed ->StataJ
##
## $partition1
## + 46/47 vertices, named, from 07a8b27:
## [1] AmS AISM AoS ANZS Bern BioJ Bcs Bka Biost CJS
## [11] CSSC CSTM CmpSt CSDA EES Envr ISR JABES JASA JAS
## [21] JBS JCGS JMA JNS JRSS-A JRSS-B JRSS-C JSCS JSPI JSS
## [31] JTSA LDA Mtka SJS StCmp Stats StMed SMMR StMod StNee
## [41] StPap SPL StSci StSin Tech Test
##
## $partition2
## + 1/47 vertex, named, from 07a8b27:
## [1] StataJ
min_cut(Cgraph.bin.sym, value.only = FALSE)
## $value
## [1] 17
##
## $cut
## + 17/871 edges from 720bc54 (vertex names):
## [1] AmS --StataJ AoS --StataJ Bcs --StataJ Bka --StataJ CJS --StataJ
## [6] CSDA --StataJ ISR --StataJ JMA --StataJ JRSS.A--StataJ JRSS.B--StataJ
## [11] JRSS.C--StataJ JSPI --StataJ JSS --StataJ LDA --StataJ SJS --StataJ
## [16] StataJ--StMed StataJ--SMMR
##
## $partition1
## + 1/47 vertex, named, from 720bc54:
## [1] StataJ
##
## $partition2
## + 46/47 vertices, named, from 720bc54:
## [1] AmS AISM AoS ANZS Bern BioJ Bcs Bka Biost CJS
## [11] CSSC CSTM CmpSt CSDA EES Envr ISR JABES JASA JAS
## [21] JBS JCGS JMA JNS JRSS.A JRSS.B JRSS.C JSCS JSPI JSS
## [31] JTSA LDA Mtka SJS StCmp Stats StMed SMMR StMod StNee
## [41] StPap SPL StSci StSin Tech Test
Optimal graph partitions may not be useful (as above).
We may want to set the group size. A simple to implement (more difficult to explain) method is based on an eigenvector of the Laplacian of the graph, a matrix formed using the adjacency matrix and diagonal matrix of nodal degrees (usually \(D - A\)). It turns out that this matrix describes graph structure, and its second-smallest eigenvector is valuable in finding a partition with a small \(R\) (cut size).
# Spectral method
set.seed(30)
N = vcount(Cgraph) #47
eigen_laplace = eigen(laplacian_matrix(Cgraph.sym))
Cgraph.sym.weights = igraph::get.edge.attribute(Cgraph.sym, "weight")
# Spectral method over all sizes
# Compare cuts by group size using modularity, which we'll define later:
mod.spec = 1:N
mod.spec2 = 1:N
mod.random = 1:N
cut.size = 1:N
for(i in 1:N) {
size.first = i
mem.spec = as.numeric(rank(eigen_laplace$vectors[,N-1]) > size.first) + 1
mem.spec2 = as.numeric(rank(-eigen_laplace$vectors[,N-1]) > size.first) + 1 #reverse with negative eigenvector
mem.random = sample(c(rep(1,i), rep(2, N-i)), N, replace = F)
mod.spec[i] = modularity(Cgraph.sym, membership = mem.spec, weights = Cgraph.sym.weights) # igraph modularity treats all networks as undirected
mod.spec2[i] = modularity(Cgraph.sym, membership = mem.spec2, weights = Cgraph.sym.weights)
mod.random[i] = modularity(Cgraph.sym, membership = mem.random, weights = Cgraph.sym.weights)
cut.size[i] = sum(Cmatrix[outer(mem.random, mem.random, "!=")]) #can see it increases for equal-sized groups
}
{plot(1:N, mod.spec, type = "l", ylim = c(-.05,.15), ylab = "modularity", xlab = "size of smaller group")
points(1:N, mod.spec2, col = "green", type ="l", lty = 2)
points(1:N, mod.random, col = "red", type ="l")
legend("topleft", legend = c("spectral partition of laplacian", "complementary spectral partition", "random partition"), col=c(1,3,2), lty=1, cex = .8)}
#Visualize the highest-modularity spectral graph division
size.first = which.max(mod.spec) #26
gp.spectral.membership = as.numeric(rank(eigen_laplace$vectors[,N-1]) > size.first) + 1
plot(Cgraph, vertex.color = 1 + gp.spectral.membership,
layout = layout.fruchterman.reingold,
edge.arrow.width = .3, edge.arrow.size = .3,
vertex.size = igraph::degree(Cgraph.sym)/2.5, #specify igraph to not conflict with sna
vertex.label.cex = .7,
vertex.label.color = "black", vertex.frame.color = "white")
Most common community detection or clustering algorithms expect undirected graphs. Community membership is a symmetric characteristic.
Modularity is another way to formalize what it means to be a “community.” Like cut size it looks to break relatively few edges when assigning communities, but it evaluates “few” relative to expectation. It is a measure of how many more edges occur within communities than we would expect randomly, given the nodal degrees.
\[\text{modularity (Q)} = \frac{1}{2m}\sum_{ij}(A_{ij}-\frac{k_ik_j}{2m})\delta(g(i),g(j)), \] where \(m\) is the number of edges in the network, and \(\delta(g(i),g(j))\) is 1 if \(i\) and \(j\) are in the same community and 0 otherwise. The \(\frac{k_ik_j}{2m}\) term represents the expected number (or value) of edges between vertices \(i\) and \(j\), based on a hypergeometric distribution.
Finding the graph split with the highest modularity is very slow for large graphs. igraph implements several different approximate algorithms.
Louvain modularity maximization (cluster_louvain)
Fast greedy (cluster_fast_greedy)
Leading eigenvector (cluster_leading_eigen)
There are many other ways to measure community structure. For example, blockmodels look for structurally equivalent communities, where nodes in a block relate similarity to other blocks. A full review is beyond our scope, but we mention another flavor of metric implementable in igraph for comparison.
Random-walk methods optimize relative to a random walker traversing a graph. These “local”” calculations may be faster and more suitable for very large networks. For example, the following in igraph:
cluster_infomap)
cluster_walktrap)
The following plots compare the output of different clustering algorithms by coloring the nodes accordingly. For consistency all clustering algorithms are applied to the symmetric version of the journal graph where the edge weights are the total citations between two journals.
library(stringr)
set.seed(30)
l = layout_with_fr(Cgraph)
cl.membership.louvain = cluster_louvain(Cgraph.sym, weights = Cgraph.sym.weights)$membership
cl.membership.fast_greedy = cluster_fast_greedy(Cgraph.sym, weights = Cgraph.sym.weights)$membership
#cl.membership.optimal = cluster_optimal(Cgraph.sym)$membership #deprecated
cl.membership.eigen = cluster_leading_eigen(Cgraph.sym, weights = Cgraph.sym.weights)$membership
cl.membership.infomap = cluster_infomap(Cgraph.sym, e.weights = Cgraph.sym.weights)$membership
cl.membership.walktrap = cluster_walktrap(Cgraph.sym, weights = Cgraph.sym.weights)$membership
# In general they defaults to use the weight variable of Cgraph.sym so weights argument not necessary
# A hacky way to relevel the membership vectors so they agree
for (elem in ls()[str_detect(ls(), "^cl\\.membership")]) {
assign(elem, as.numeric(relevel(as.factor(get(elem)), ref = get(elem)[1])))
}
cl.results = list(Louvain = cl.membership.louvain,
Fast_Greedy = cl.membership.fast_greedy,
Eigen = cl.membership.eigen,
Infomap = cl.membership.infomap,
Walktrap = cl.membership.walktrap,
Spectral_graph_partition = gp.spectral.membership)
#Calculate modularity for each output
cl.mods = round(sapply(cl.results, modularity, x = Cgraph.sym, weights = Cgraph.sym.weights), 3)
#plot - label with modularity
{par(mfrow = c(2,3))
par(mai = rep(.2, 4))
for (i in seq_along(cl.results))
{plot(Cgraph.sym, vertex.color = cl.results[[i]], layout = l, cex = 2,
main = paste(names(cl.results)[i], cl.mods[[i]], collapse = ""))}}
igraph has a cluster_optimal function to do an exhaustive search for the highest-modularity communities, but this WILL NOT WORK unless you install the development version of igraph, see: https://github.com/igraph/rigraph
cl.optimal = cluster_optimal(Cgraph.sym, weights = Cgraph.sym.weights)
plot(Cgraph.sym, vertex.color = cl.optimal$membership, layout = l, cex = 2,
main = paste("Optimal modularity", round(cl.optimal$modularity, 3), collapse = ""))
There are lots of functions in igraph to calculate graph statistics. Many, including those below, assume a binary network and don’t account for edge weights.
# Reciprocity
reciprocity(Cgraph, ignore.loops = TRUE, mode = c("default"))
sum((Cmatrix>0) * (t(Cmatrix)>0)) / sum(Cmatrix>0)
reciprocity(Cgraph, ignore.loops = TRUE, mode = c("ratio"))
sum((Cmatrix>0) * (t(Cmatrix)>0))/ sum(Tmatrix>0) #if we know there's at least one connection
# For weighted graphs, a simple correlation measure
cor(as.vector(Cmatrix), as.vector(t(Cmatrix)))
assortativity_degree(Cgraph, directed = T) #treated as unweighted
assortativity_nominal(Cgraph, types = cl.membership.louvain, directed = T)
transitivity(Cgraph, type = "global", weights = Cgraph.weights) #treated as undirected, unweighted
transitivity(Cgraph.bin.sym, type = "global")
fastnet which can do these calculations fast using parallelization and sampling (Shaikh, Dong, and Castro (2018)).See here for a list of igraph functions: - https://igraph.org/r/doc/
latentnetlatentnet (version 2.9.0): Krivitsky and Handcock (2008), Krivitsky and Handcock (2018)
Depends: R (≥ 2.8.0), statnet.common (≥ 3.1.0), network, ergm (≥ 3.9.0)
Imports: sna, mvtnorm, abind, coda (≥ 0.17.1), tools, MASS
Authors: Krivitsky, Handcock, Hoff (model)
A more specialized package whose authors and applications overlap with statnet, but the model has a very different flavor.
My favorite : )
A possible expression for a latent-space model of a valued, directed network.
\[ y_{ij} \sim Poisson(exp(\alpha + X_{ij}\beta- \lVert(z_i, z_j)\rVert)), \]
\[ P(Y=y|\theta) = \prod_{i\neq j} \frac{exp[-exp(\alpha + X_{ij}\beta- \lVert(z_i, z_j)\rVert)]exp(\alpha + X_{ij}\beta- \lVert(z_i, z_j)\rVert)^{y_{ij}}}{y_{ij}!},\] \[ log(P(Y|\theta)) = \sum_{i\neq j} -exp(\alpha + X_{ij}\beta- \lVert(z_i, z_j)\rVert) + y_{ij}(\alpha + X_{ij}\beta- \lVert(z_i, z_j)\rVert)) + \text{C}.\]
where \(\alpha\) is an intercept term, \(X_{ij}\) is a general expression for node or edge-specific covariates for edge \(ij\), and \(\beta\) is a vector of coefficients. \(z_i\) is the position of node \(i\) in \(d\)-\(dimensional\) space, where the value of \(d\) is set before the model is fit. \(\theta\) is shorthand for the full parameter set. \(\lVert(z_i, z_j)\rVert\) represents the Euclidean distance between \(z_i\) and \(z_j\), so the farther apart two nodes are in the latent space, the lower we expect their edge weight to be. \(C\) is a constant we can ignore when fitting the model.
These models are easy to implement in the package latentnet (Krivitsky and Handcock (2018)), but because of the estimation method they can be slow for networks of hundreds of nodes or more. latentnet allows you to find an approximate maximum likelihood estimate of the parameters quickly by setting tofit = "mle", but you lose information about the model.
euclidean(d=2) argument indicates that the latent positions are two-dimensional.response argument conveys a valued networkrsender and rreceiver add random effects for each node that describe in-strength and out-strength.?ergmm.terms for more terms you can add to a latent space modellatent.srp2 = ergmm(Cnet ~ euclidean(d = 2) + rsender + rreceiver,
response = "citations", family = "Poisson.log", seed = 30,
tofit = c("mle"),
control = ergmm.control(mle.maxit = 100))
A better but slower estimation method is MCMC, which gives you information about uncertainty in the estimated parameters. The following model takes a few minutes to run, so the output is saved in the repository’s output folder. Enter latent.srp2 = readRDS("output/latent_srp2.RDS") in the console to load it.
latent.srp2 = ergmm(Cnet ~ euclidean(d = 2) + rsender + rreceiver,
response = "citations", family = "Poisson.log", seed = 1,
tofit = c("mcmc", "mkl", "procrustes", "mle"),
control = ergmm.control(burnin = 500000, interval = 100,
sample.size = 10000, pilot.runs = 10))
Always check mcmc.diagnostics of your model.
We can plot the resulting positions and their uncertainty using the sample produced when fitting the model.
# Always check mcmc diagnostics
mcmc.diagnostics(latent.srp2)
# make coloring correspond to visNetwork
library(scales)
alpha2 <- 0.6
color1 <- c("deepskyblue3","gold", "red", "green", "deeppink", "darkorchid4", "orange", "black")
journals.cluster <- hclust(d = as.dist(1 - cor(Tmatrix))) #method used in JRSS-A paper @varin16
vc2 <- (latent.srp2$mkl$receiver - latent.srp2$mkl$sender)/2+1
col2 <- alpha(color1[cutree(journals.cluster, h = 0.6)], alpha2)
# plot window
par(mai = c(.1,.1,.1,.1))
par(mfrow = c(2,1))
#### position plot - clusters and their labels are from the paper @varin16 ####
{plot(latent.srp2$mkl$Z, col = col2, cex = vc2, pch = 16, ylim = c(-2,4), xlim = c(-6,5),
bty="n", yaxt="n", xaxt="n", xlab = "", ylab = "")
text(latent.srp2$mkl$Z, labels = rownames(Cmatrix), cex = .5, pos = 3, offset = .3)
legend("topleft", col = alpha(color1, alpha2), pch = 16, legend=c("review", "general", "theory/method", "appl./health", "computational","eco./envir.", "JSS", "StataJ"), cex=0.7, box.col=0)
#### cloud/uncertainty plot ####
n = 47
N = 1000
p = latent.srp2[["sample"]][["Z"]]
s = sample(5000, N)
#null plot:
plot(latent.srp2$mkl$Z, xlab = NA, ylab = NA, bty="n", yaxt="n", xaxt="n", type ="n",
xlim = c(-6,4), ylim = c(-3,5), main = NA)
#add points
for (i in 1:N) {
points(p[s[i],,], col = col2, pch = 15, cex = .2)
}}
Now the communities we found earlier are clear in our visualization:
{plot(latent.srp2$mkl$Z, col = 1 + cl.membership.louvain, cex = vc2, pch = 16,
bty="n", yaxt="n", xaxt="n", xlab = "", ylab = "", ylim = c(-2,3),
main = "Visualization of louvain communities using latent positions")
text(latent.srp2$mkl$Z, labels = rownames(Cmatrix), cex = .5, pos = 3, offset = .3)}
We can de-clutter the visualizations above using an interactive plot. The output is saved in the repository’s output folder as “journals_visNetwork.html”
library(visNetwork)
# data ####
nodes <- data.frame(id = 1:47,
label = Cnet%v%"vertex.names",
title = Cnet%v%"vertex.names", #for tooltip
value = vc2, #conveys node size, on a scale from 0-1
group = rep((cutree(journals.cluster, h = 0.6))))
edges <- data.frame(from=data.frame(as.edgelist(Cnet))$X1,
to=data.frame(as.edgelist(Cnet))$X2)
cite_igraph <- graph.data.frame(edges, directed=TRUE, vertices=nodes)
Z = as.matrix(latent.srp2$mkl$Z)
Z<- -Z[ ,c(2,1)] #to preserve orientation
# Thin the edges
strength = .05 #only include "strong receivers", e.g. receiver accounts for >.05 = 5% of sender's citations sent
# Cmatrix.norm = t(Cmatrix)/citing #entries are % of i's citations to j (column); rows sum to 1
Cstrong = t(Cmatrix)
Cstrong[Cmatrix.norm<strength] = 0
Cnet_strong = as.network(Cstrong, directed=T, matrix.type="a", ignore.eval=F, names.eval="citations")
edges2 <- data.frame(from=data.frame(as.edgelist(Cnet_strong))$X1,
to=data.frame(as.edgelist(Cnet_strong))$X2,
value = as.edgelist(Cnet_strong, as.sna.edgelist = T, attrname = "citations")[,3])
# dynamic plot ####
visNetwork(nodes, edges2, main = "Statistics Journals", submain = "latent space positions") %>%
visIgraphLayout(cite_igraph, layout = "layout.norm",
layoutMatrix = cbind(-Z[,2],Z[,1]), type = "full") %>%
visNodes(shape = "dot",
scaling = list(min = 3, max = 15),
labelHighlightBold = TRUE,
font= list(size = 20, color = "black", strokeWidth = 1)) %>%
visOptions(highlightNearest = list(enabled = TRUE, degree = .5),
nodesIdSelection = FALSE,
selectedBy = "title") %>%
visEdges(shadow = FALSE,
scaling = list(min = 1, max = 15),
arrows = list(to = list(enabled = FALSE, scaleFactor = 3),
middle = list(enabled = TRUE, scaleFactor = 1)),
color = list(inherit = "from", highlight = "red", opacity = .03),
hidden = F,
smooth = list(type = "curvedCW"))
netCoin.The netCoin package (Escobar et al. (2018)) is geared to analyzing co-incidence networks, but has a great tool for interactive plotting that can be used for any relatively small network.
library(netCoin)
# Create a netCoin object for plotting
# First create a coincidence matrix from the sociomatrix
sc.coin = coin(t(as.sociomatrix(sc.bipartite)))
# Create node data frame
tmp = asNodes(sc.coin); tmp$Label = sc.nodes$Label[1:n.judges]
sc.nodes = read.csv(file.path(wd, "data/GSoCversionOpinions-NODES.csv"))
sc.nodes = data.frame(cbind(name = as.character(sc.nodes$Label), sc.nodes))
sc.nodes.tmp = sc.nodes[1:n.judges,]
sc.nodes.tmp$name = 1:14 # a character vector didn't work REPORT!
sc.nodes.tmp$frequency = tmp$frequency
We’re concerned from earlier that the number of overlapping cases between two justices is influencing their implied similarity.
We can pick a similarity metric that somewhat accounts for this. If neither judge is attached to a decision, it is because neither made that decision, or one or both were not on the bench at that time. We are not interested in the latter case, so we want to focus our attention on the number of times they decided in the same way, relative to the number of times they decided in the same or different ways. For this we can use the Kulczynski similarity metric (Escobar et al. (2018)).
\[\text{Kulczynski} = (\frac{a}{a+b} + \frac{a}{a+c})/2,\]
where \(a, b, c, d\) describe co-occurrence between two events as follows:
| present | absent | |
|---|---|---|
| present | a | c |
| absent | b | d |
A complete analysis would normalize for the number of possible co-occurrences of any two justices when calculating edge strength or similarity more generally.
# Create a netCoin network object for interactive plotting.
sc.allnet = allNet(incidences = t(as.sociomatrix(sc.bipartite)), nodes = sc.nodes.tmp, procedures = c("f", "i", "ii", "r", "k", "h"))
# The prodecures argument tells it which similarity measures to calculate. We can access them in the interactive plot
# Generate interactive plot
plot(sc.allnet)
This should generate a plot in a web browser. We can use the drop-down to compare the various similarity/distance metrics and use the color and width to highlight the strength of relationships.
These are the formulas for the other metrics in the Links drop-down menus. Here the diagonal of the adjacency matrix is equal to the number of appearances of each justice:
As expected, the Kulczynski metric can highlight relationships other than those among the nodes that appear most often. The Haberman metric, and with it the P(Z), can as well reflecting their standardization. The following table summarizes the correlations between the metrics.
round(cor(sc.allnet[["links"]][,c("coincidences", "sConditional", "tConditional", "Russell", "Kulczynski", "Haberman", "p(Z)")]), 2)
## coincidences sConditional tConditional Russell Kulczynski Haberman
## coincidences 1.00 0.66 0.27 1.00 0.73 0.48
## sConditional 0.66 1.00 -0.20 0.66 0.63 0.53
## tConditional 0.27 -0.20 1.00 0.27 0.64 0.63
## Russell 1.00 0.66 0.27 1.00 0.73 0.48
## Kulczynski 0.73 0.63 0.64 0.73 1.00 0.92
## Haberman 0.48 0.53 0.63 0.48 0.92 1.00
## p(Z) -0.31 -0.38 -0.44 -0.31 -0.64 -0.71
## p(Z)
## coincidences -0.31
## sConditional -0.38
## tConditional -0.44
## Russell -0.31
## Kulczynski -0.64
## Haberman -0.71
## p(Z) 1.00
networkDynamic and ndtv